P3 - Data Wrangling with MongoDB

OpenStreetMap Project Data Wrangling with MongoDB

Gangadhara Naga Sai

Data used - MapZen Weekly OpenStreetMaps Metro Extracts

Map Areas: These two maps are selected since ,right now i am living at Hoodi,Bengaluru. And my dream is to do my masters in japan in robotics,so i had selected locality of University of tokyo, Bunkyo.I really wanted to explore differences between the regions.



**1. Problems Encountered**

  • Some of names were in different Languages so ,i had to filter out them and select english names for both maps Hoodi and Bunkyo
  • Street names with different types of abbreviations. (i.e. 'Clark Ave SE' or 'Eubank Northeast Ste E-18')
  • Two cities have to be accessed from one database

Names in Different Language

Different regions have different languages ,and we find that someof names were in different language which are filltered to get only english names. Which would check weather the charecters belong to ascii or not


In [1]:
def isEnglish(string):
    try:
        string.encode('ascii')
    except UnicodeEncodeError:
        return False
    else:
        return True

Over-­abbreviated Names

Since the most of data being manually uploaded, there are lot of abbreviations in street names,locality names. Where they are filtered and replaced with full names.


In [ ]:
#the city below can be hoodi or bunkyo
for st_type, ways in city_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping)
        if name != better_name:
            print name, "=>", better_name

In [ ]:
#few examples 
Bunkyo:
Meidai Jr. High Sch. => Meidai Junior High School
St. Mary's Cathedral => Saint Mary's Cathedral
Shinryukei brdg. E. => Shinryukei Bridge East
Iidabashi Sta. E. => Iidabashi Station East
...

Hoodi:
St. Thomas School => Saint Thomas School
Opp. Jagrithi Apartment => Opposite Jagrithi Apartment
...

Merging Both cities

These two maps are selected since ,right now i am living at Hoodi,Bengaluru. And one day i want do my masters in japan in robotics,so i had selected locality of University of tokyo, Bunkyo.I really wanted to explore differences between the regions.

I need to add a tag named "city" so i can differentiate them from the database.


2. Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them.

File Sizes


In [ ]:
bangalore.osm -40MB
bangalore.osm.json-51MB
tokyo1.osm- 82MB
tokyo1.osm.json-102.351MB

Number of documents


In [6]:
print "Bunkyo:",mongo_db.cities.find({'city':'bunkyo'}).count()
print "Hoodi:",mongo_db.cities.find({'city':'hoodi'}).count()
  • Bunkyo: 1268292
  • Hoodi: 667842

Number of node nodes.


In [ ]:
print "Bunkyo:",mongo_db.cities.find({"type":"node",
                                    'city':'bunkyo'}).count()
print "Hoodi:",mongo_db.cities.find({"type":"node",
                                    'city':'hoodi'}).count()

In [1]:
Bunkyo: 1051170
Hoodi: 548862

Number of way nodes.


In [1]:
print "Bunkyo:",mongo_db.cities.find({'type':'way',
                                  'city':'bunkyo'}).count()
print "Hoodi:",mongo_db.cities.find({'type':'way',
                                  'city':'hoodi'}).count()

In [1]:
Bunkyo: 217122
Hoodi: 118980

Total Number of contributor.


In [1]:
print "Constributors:", len(mongo_db.cities.distinct("created.user"))

In [1]:
Contributors: 858

3. Additional Data Exploration using MongoDB

I am going to use the pipeline function to retrive data from the database


In [ ]:
def pipeline(city):
    p= [{"$match":{"created.user":{"$exists":1},
                                          "city":city}},
                 {"$group": {"_id": {"City":"$city",
                                     "User":"$created.user"},
                            "contribution": {"$sum": 1}}},                            
                 {"$project": {'_id':0,
                               "City":"$_id.City",
                               "User_Name":"$_id.User",
                               "Total_contribution":"$contribution"}},
                 {"$sort": {"Total_contribution": -1}},
                 {"$limit" : 5 }]
    return p
result1 =mongo_db["cities"].aggregate(pipeline('bunkyo'))
for each in result1:    
    print(each)
print("\n")
result2 =mongo_db["cities"].aggregate(pipeline('hoodi'))
for each in result2:    
    print(each)
Bunkyo: {u'City': u'bunkyo', u'User_Name': u'kurauchi', u'Total_contribution': 667425} {u'City': u'bunkyo', u'User_Name': u'watao', u'Total_contribution': 216855} {u'City': u'bunkyo', u'User_Name': u'higa4', u'Total_contribution': 40845} {u'City': u'bunkyo', u'User_Name': u'ikiya', u'Total_contribution': 37287} {u'City': u'bunkyo', u'User_Name': u'javbw', u'Total_contribution': 26655} Hoodi: {u'City': u'hoodi', u'User_Name': u'praveeng', u'Total_contribution': 75153} {u'City': u'hoodi', u'User_Name': u'akhilsai', u'Total_contribution': 70509} {u'City': u'hoodi', u'User_Name': u'anthony1', u'Total_contribution': 52194} {u'City': u'hoodi', u'User_Name': u'anushapyata', u'Total_contribution': 45540} {u'City': u'hoodi', u'User_Name': u'docaneesh', u'Total_contribution': 38703}

The top contributors for hoodi are no where near since bunkyo being a more compact region than hoodi ,there are more places to contribute.


To get the top Amenities in Hoodi and Bunkyo

I will be showing the pipeline that will go in the above mentioned "Pipleline" function


In [ ]:
pipeline=[{"$match":{"Additional Information.amenity":{"$exists":1},
                                          "city":city}},
                 {"$group": {"_id": {"City":"$city",
                                    "Amenity":"$Additional Information.amenity"},
                            "count": {"$sum": 1}}},
                 {"$project": {'_id':0,
                               "City":"$_id.City",
                               "Amenity":"$_id.Amenity",
                               "Count":"$count"}},
                 {"$sort": {"Count": -1}},
                 {"$limit" : 10 }]
Bunkyo: {u'Count': 1587, u'City': u'bunkyo', u'Amenity': u'parking'} {u'Count': 1497, u'City': u'bunkyo', u'Amenity': u'restaurant'} {u'Count': 933, u'City': u'bunkyo', u'Amenity': u'cafe'} {u'Count': 792, u'City': u'bunkyo', u'Amenity': u'fast_food'} {u'Count': 723, u'City': u'bunkyo', u'Amenity': u'school'} {u'Count': 606, u'City': u'bunkyo', u'Amenity': u'place_of_worship'} {u'Count': 525, u'City': u'bunkyo', u'Amenity': u'vending_machine'} {u'Count': 507, u'City': u'bunkyo', u'Amenity': u'bench'} {u'Count': 354, u'City': u'bunkyo', u'Amenity': u'pub'} {u'Count': 342, u'City': u'bunkyo', u'Amenity': u'kindergarten'} Hoodi: {u'Count': 528, u'City': u'hoodi', u'Amenity': u'restaurant'} {u'Count': 216, u'City': u'hoodi', u'Amenity': u'school'} {u'Count': 189, u'City': u'hoodi', u'Amenity': u'atm'} {u'Count': 162, u'City': u'hoodi', u'Amenity': u'pharmacy'} {u'Count': 162, u'City': u'hoodi', u'Amenity': u'parking'} {u'Count': 162, u'City': u'hoodi', u'Amenity': u'place_of_worship'} {u'Count': 141, u'City': u'hoodi', u'Amenity': u'bank'} {u'Count': 141, u'City': u'hoodi', u'Amenity': u'fast_food'} {u'Count': 138, u'City': u'hoodi', u'Amenity': u'hospital'} {u'Count': 102, u'City': u'hoodi', u'Amenity': u'cafe'}

As compared to hoodi ,bunkyo have few atms,And parking can be commonly found in bunkyo locality



In [ ]:
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
                                  "Additional Information.amenity":"place_of_worship",
                                  "city":city}},
                       {"$group":{"_id": {"City":"$city",
                                          "Religion":"$Additional Information.religion"},
                                  "count":{"$sum":1}}},
                       {"$project":{"_id":0,
                                    "City":"$_id.City",
                                    "Religion":"$_id.Religion",
                                    "Count":"$count"}},
                       {"$sort":{"Count":-1}},
                       {"$limit":6}]
Bunkyo: {u'Count': 303, u'City': u'bunkyo', u'Religion': u'buddhist'} {u'Count': 132, u'City': u'bunkyo', u'Religion': u'shinto'} {u'Count': 123, u'City': u'bunkyo'} {u'Count': 39, u'City': u'bunkyo', u'Religion': u'christian'} {u'Count': 3, u'City': u'bunkyo', u'Religion': u'muslim'} {u'Count': 3, u'City': u'bunkyo', u'Religion': u'confucian'} Hoodi: {u'Count': 90, u'City': u'hoodi', u'Religion': u'hindu'} {u'Count': 30, u'City': u'hoodi', u'Religion': u'christian'} {u'Count': 24, u'City': u'hoodi'} {u'Count': 18, u'City': u'hoodi', u'Religion': u'muslim'}

As expected japan is popular with buddism,

but india being a secular country it will be having most of the reglious places of worship,where hinduism being majority



In [ ]:
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
                                 "Additional Information.amenity":"restaurant",
                                 "city":city}},
                      {"$group":{"_id":{"City":"$city",
                                        "Food":"$Additional Information.cuisine"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "City":"$_id.City",
                                  "Food":"$_id.Food",
                                  "Count":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":6}]

{u'Count': 582, u'City': u'bunkyo'} {u'Food': u'japanese', u'City': u'bunkyo', u'Count': 192} {u'Food': u'chinese', u'City': u'bunkyo', u'Count': 126} {u'Food': u'italian', u'City': u'bunkyo', u'Count': 69} {u'Food': u'indian', u'City': u'bunkyo', u'Count': 63} {u'Food': u'sushi', u'City': u'bunkyo', u'Count': 63}

{u'Count': 213, u'City': u'hoodi'} {u'Food': u'regional', u'City': u'hoodi', u'Count': 75} {u'Food': u'indian', u'City': u'hoodi', u'Count': 69} {u'Food': u'chinese', u'City': u'hoodi', u'Count': 36} {u'Food': u'international', u'City': u'hoodi', u'Count': 24} {u'Food': u'Andhra', u'City': u'hoodi', u'Count': 21}

Indian style cusine in Bunkyo seems famous, Which will be better if i go to japan and do my higher studies there.



In [ ]:
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
                                 "Additional Information.amenity":"fast_food",
                                 "city":city}},
                      {"$group":{"_id":{"City":"$city",
                                        "Food":"$Additional Information.cuisine"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "City":"$_id.City",
                                  "Food":"$_id.Food",
                                  "Count":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":6}]
{u'Count': 246, u'City': u'bunkyo'} {u'Food': u'burger', u'City': u'bunkyo', u'Count': 102} {u'Food': u'ramen', u'City': u'bunkyo', u'Count': 81} {u'Food': u'japanese', u'City': u'bunkyo', u'Count': 54} {u'Food': u'noodle', u'City': u'bunkyo', u'Count': 51} {u'Food': u'noodle;ramen', u'City': u'bunkyo', u'Count': 33} {u'Count': 66, u'City': u'hoodi'} {u'Food': u'pizza', u'City': u'hoodi', u'Count': 21} {u'Food': u'indian', u'City': u'hoodi', u'Count': 12} {u'Food': u'chicken', u'City': u'hoodi', u'Count': 6} {u'Food': u'ice_cream', u'City': u'hoodi', u'Count': 6} {u'Food': u'burger', u'City': u'hoodi', u'Count': 6}

Burger seems very popular among japanese in fast foods,i was expecting ramen to be more popular

, but in hoodi pizza is really common,being a metropolitan city.


ATM's near locality


In [ ]:
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
                                 "Additional Information.amenity":"atm",
                                 "city":city}},
                      {"$group":{"_id":{"City":"$city",
                                        "Name":"$Additional Information.name:en"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "City":"$_id.City",
                                  "Name":"$_id.Name",
                                  "Count":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":4}]
{u'Count': 75, u'City': u'bunkyo'} {u'Count': 6, u'City': u'bunkyo', u'Name': u'JP Bank'} {u'Count': 6, u'City': u'bunkyo', u'Name': u'Mizuho'} {u'Count': 177, u'City': u'hoodi'} {u'Count': 3, u'City': u'hoodi', u'Name': u'Axis bank ATM'} {u'Count': 3, u'City': u'hoodi', u'Name': u'HDFC'} {u'Count': 3, u'City': u'hoodi', u'Name': u'Canara'} {u'Count': 3, u'City': u'hoodi', u'Name': u'State bank ATM'}

There are quite a few ATM in Bunkyo as compared to hoodi


Martial arts or Dojo Center near locality


In [ ]:
##  Martial arts or Dojo Center near locality
import re

pat = re.compile(r'dojo', re.I)
d=mongo_db.cities.aggregate([{"$match":{ "$or": [ { "Additional Information.name": {'$regex': pat}}
                                                   ,{"Additional Information.amenity": {'$regex': pat}}]}}
                            ,{"$group":{"_id":{"City":"$city"
                             , "Sport":"$Additional Information.name"}}}])
for each in d:    
    print(each)

In [ ]:
bunkyo: 
{u'_id': {u'City': u'bunkyo', u'Sport': u'Aikikai Hombu Dojo'}}
{u'_id': {u'City': u'bunkyo', u'Sport': u'Kodokan Dojo'}}

hoodi:
{u'_id': {u'City': u'hoodi', u'Sport': u"M S Gurukkal's Kalari Academy"}}

I wanted to learn martial arts , In japan is known for its akido and other ninjistsu martial arts , where i can find some in bunkyo Where as in hoodi,india Kalaripayattu Martial Arts are one of the ancient arts that ever existed.



In [ ]:
p = [{"$match":{"Additional Information.shop":{"$exists":1},
                                          "city":city}},
                       {"$group":{"_id":{"City":"$city",
                                  "Shop":"$Additional Information.shop"},
                           "count":{"$sum":1}}},
                       {"$project": {'_id':0,
                                     "City":"$_id.City",
                                     "Shop":"$_id.Shop",
                                     "Count":"$count"}},
                       {"$sort":{"Count":-1}},
                       {"$limit":10}]

In [ ]:
{u'Shop': u'convenience', u'City': u'bunkyo', u'Count': 1035}
{u'Shop': u'clothes', u'City': u'bunkyo', u'Count': 282}
{u'Shop': u'books', u'City': u'bunkyo', u'Count': 225}
{u'Shop': u'mobile_phone', u'City': u'bunkyo', u'Count': 186}
{u'Shop': u'confectionery', u'City': u'bunkyo', u'Count': 156}
{u'Shop': u'supermarket', u'City': u'bunkyo', u'Count': 150}
{u'Shop': u'computer', u'City': u'bunkyo', u'Count': 126}
{u'Shop': u'hairdresser', u'City': u'bunkyo', u'Count': 90}
{u'Shop': u'electronics', u'City': u'bunkyo', u'Count': 90}
{u'Shop': u'anime', u'City': u'bunkyo', u'Count': 90}


{u'Shop': u'clothes', u'City': u'hoodi', u'Count': 342}
{u'Shop': u'supermarket', u'City': u'hoodi', u'Count': 129}
{u'Shop': u'bakery', u'City': u'hoodi', u'Count': 120}
{u'Shop': u'shoes', u'City': u'hoodi', u'Count': 72}
{u'Shop': u'furniture', u'City': u'hoodi', u'Count': 72}
{u'Shop': u'sports', u'City': u'hoodi', u'Count': 66}
{u'Shop': u'electronics', u'City': u'hoodi', u'Count': 60}
{u'Shop': u'beauty', u'City': u'hoodi', u'Count': 54}
{u'Shop': u'car', u'City': u'hoodi', u'Count': 36}
{u'Shop': u'convenience', u'City': u'hoodi', u'Count': 36}

In [ ]:
The general stores are quite common in both the places

In [ ]:
p = [{"$match":{"Additional Information.shop":{"$exists":1},
                           "city":city,
                           "Additional Information.shop":"supermarket"}},
                       {"$group":{"_id":{"City":"$city",
                                  "Supermarket":"$Additional Information.name"},
                           "count":{"$sum":1}}},
                       {"$project": {'_id':0,
                                     "City":"$_id.City",
                                     "Supermarket":"$_id.Supermarket",
                                     "Count":"$count"}},
                       {"$sort":{"Count":-1}},
                       {"$limit":5}]

In [ ]:
{u'Count': 120, u'City': u'bunkyo'}
{u'Count': 9, u'City': u'bunkyo', u'Supermarket': u'Maruetsu'}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u"Y's Mart"}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u'SainE'}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u'DAIMARU Peacock'}


{u'Count': 9, u'City': u'hoodi', u'Supermarket': u'Reliance Fresh'}
{u'Count': 9, u'City': u'hoodi'}
{u'Count': 6, u'City': u'hoodi', u'Supermarket': u"Nilgiri's"}
{u'Count': 3, u'City': u'hoodi', u'Supermarket': u'Royal Mart Supermarket'}
{u'Count': 3, u'City': u'hoodi', u'Supermarket': u'Safal'}

These are few common supermarket brands in both the cities And Nilgiris is like 500 meters away from my home.


4. Conclusion

After such a investigation on this data i think i have become familiar with bunkyo region.

I was expecting a difficulty in merging both the cities data into a single database ,but seem a simple key like city could differentiate them.

There might be even robust cleaning algorithms to a better and clean database,as most of the data is from gps that goes into OpenStreetMap.org. Which needed to be regularly cleaned.

From the comparision of both the cities these are qiute similar and bunkyo region interests me even more to pursue higher studies.